Differential Expression using DESeq2/edgeR

Last week we talked about Experimental Design. This week we’re going to apply that knowledge to finding differential expressed genes or transcripts - finding those genes or transcripts that are expressed differently between groups in our experimental design.

That is, in general, are there genes for the samples in group A that have different counts than the samples in group B?

We’ll Explore the DESeq2 package as one method to find differentially expressed genes. There is also the edgeR package.

Am I doing this right?

If you are new to differential expression, it is worth talking with the Bioinformatics Core about your experimental design and how to set up your model matrix and contrasts.

You did consult with them about the experimental design of your experiment before you sent the samples to be processed, I hope.

Count Data is Discrete Data

What does it mean when we say a set of samples is differentially expressed when we are using count data?

Last week, we talked about the uneven library sizes and how they can affect our expression estimates for each sample. Our solution to this was to scale each sample and expression estimates so the had the same library size.

tidySummarizedExperiment says: A data frame is returned for independent data analysis.

Warning in DESeq2::DESeqDataSet(GSE96870_filtered, design = ~sex + time): some
variables in design formula are characters, converting to factors
# A SummarizedExperiment-tibble abstraction: 603,460 × 9
# Features=27430 | Samples=22 | Assays=counts
   .feature     .sample    counts sizeFactor seqnames  start    end width strand
   <chr>        <chr>       <int>      <dbl> <fct>     <int>  <int> <int> <fct> 
 1 Xkr4         GSM2545337   2410      0.909 1        3.67e6 3.67e6  1191 -     
 2 LOC105243853 GSM2545337      0      0.909 1        3.36e6 3.37e6  9183 +     
 3 LOC105242387 GSM2545337    121      0.909 1        3.66e6 3.67e6 11610 -     
 4 LOC105242467 GSM2545337      5      0.909 1        4.23e6 4.23e6   293 -     
 5 Rp1          GSM2545337      2      0.909 1        4.41e6 4.41e6    72 -     
 6 Sox17        GSM2545337    239      0.909 1        4.50e6 4.50e6  1064 -     
 7 Gm7357       GSM2545337      0      0.909 1        4.52e6 4.52e6   775 +     
 8 LOC105243855 GSM2545337      3      0.909 1        4.59e6 4.59e6   801 -     
 9 Gm6123       GSM2545337      7      0.909 1        4.77e6 4.77e6  1289 +     
10 Mrpl15       GSM2545337   1019      0.909 1        4.79e6 4.79e6   154 -     
# ℹ 40 more rows

As you can see, some, but not all of the intra-timepoint variation was removed, especially within the Day0 timepoint. We will attempt to account for this variability in our model.

This week we’ll talk about the model fitting procedure.

What groups are we going to compare?

In our analysis, we are going to compare the two timepoints in our study: Day0 and Day8.

We want to find genes that have expression differences between these two timepoints.

The General Differential Expression Workflow

In the general workflow, we are in the box below.

graph TD
A[Reading in Data] --> B
B[Metadata] --> C
C[Data Exploration] --> D
subgraph "DESeq2"
D[Filtering and Normalization] --> E
E[Design Matrix] --> F
F[Dispersion Estimation] --> G
G[Differential Expression]
end
G --> H[Annotation and Gene Set Analysis]
G --> I[Clustering and Visualization]
G --> J[Pathway and Network Analysis]

Filtering and Normalization

We already talked about filtering out low expression counts and normalizing by library size in the previous week.

Models as Filters

When we are calculating differential expression, we are using our model fitting as a filter for good and bad candidates. Our statistical test will generate a p-value, and we somehow use this p-value as a filtering criteria.

We have two ways of filtering candidates:

  1. Filter by Mean Fold Change in expression
  2. Filter by model fit (p-value)

So far, so good. What some things that might make the modeling difficult?

  1. Variation in library size (handled by normalization)
  2. Low expression counts or zero expression counts (handled by modeling)

We’ll explore the distinctions the model makes in the next section.

Fold Change

Fold change represents the effect size, or our expression difference.

We calculate fold change by taking the mean expression for group 1 (day0) and dividing it from the mean expression for group 2.

For example, for asl:

GSE_means <- GSE96870_normalized |>
  filter(time %in% c("Day0", "Day8")) |>
  filter(.feature == "Asl") |>
  mutate(scaled = counts / sizeFactor, time=factor(time)) |>
  group_by(time) |>
  summarize(mean_expression = mean(scaled))
tidySummarizedExperiment says: A data frame is returned for independent data analysis.
GSE96870_normalized |>
    filter(.feature == "Asl") |>
     filter(time %in% c("Day0", "Day8")) |>
    mutate(scaled = counts / sizeFactor) |>
    ggplot() +
    geom_point(aes(x=time, y=scaled, color=time)) +
    geom_errorbar(mapping=aes(x=time, 
                              ymin=mean_expression, 
                              ymax=mean_expression), 
                  data=GSE_means) +
  annotate(geom="text", x=1.2, y=490, label="Mean Day0") +
    annotate(geom="text", x=2.2, y=1030, label="Mean Day8") 

Let’s calculate the mean expression by time:

GSE_means <- GSE96870_normalized |>
  filter(time %in% c("Day0", "Day8")) |>
  filter(.feature == "Asl") |>
  mutate(scaled = counts / sizeFactor, time=factor(time)) |>
  group_by(time) |>
  summarize(mean_expression = mean(scaled))
tidySummarizedExperiment says: A data frame is returned for independent data analysis.
GSE_means
# A tibble: 2 × 2
  time  mean_expression
  <fct>           <dbl>
1 Day0             466.
2 Day8            1012.

To calculate the fold-change, we divide day8 mean expression (1012) by day0 mean expression (466), getting a fold change of 2.2. The positive value means that mean(day0) < mean(day8).

Note that a negative fold change means that the fold change is in the opposite direction: mean(day0) > mean(day8).

Volcano Plot

Below is a Shiny app that lets you explore the two groups in our data, day0, and day8. Be patient, it takes a moment to load.

Volcano plots are one way of displaying our results of differential expression analysis. They are a little bit hard to understand because they are on log scale, so let’s review what these axes represent.

  1. On our x-axis, Log2 fold expression. A handy rule of thumb is that a fold change of 2 means log2(2) = 1 on our axes.
  2. On our y-axis, -log2 p-value. The thing to remember, is as log2pvalue goes up, the lower the p-value.

We often put lines at log2foldchange = 1 and -1, and a line at log2(0.05), or our threshold for the p-value.

Try the following out:

  1. Pick a high fold change / high log2pvalue candidate and click on it. Do you agree that it is a good expression candidate? Look at the overlap in expression values.
  2. Pick a couple of high fold change / low log2pvalue candidates. Do you agree they are bad candidates? Why?
  3. Pick one of the lowest p-value candidates and look at the expression differences.

It’s important to have an intuition for how the model is discriminating between expression candidates.

Model Fitting

The model that we use in DESeq2 is to fit negative binomial distributions to the data. This is because:

  1. Our data begins at 0 and goes up
  2. It is skewed (most of the genes have very low expression)

The main thing we are fitting is the dispersion of the data, which is like the variance of the data, . The main thing to remember is that lower expression values are less reliable than the higher expression values.

We use estimateDispersions() in DESeq2 to fit our model.

GSE96970_est <- DESeq2::estimateDispersions(GSE96870_normalized)
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
plotDispEsts(GSE96970_est)

Let’s dive more into this plot. Each point represents the dispersion.

Our model fits the overall dispersion across all genes and all mean expression counts.

Our beginning points are highlighted in black. After fitting, our points become those in blue. Notice that our dispersion estimates are actually “shrinking”, especially for those candidates that have high dispersion.

Shrinkage

The model is actually making lower expressed values reduce the amount of dispersion.

Genes with high dispersion experience the most amount of shrinkage. They actually pay a penalty in that their fold-change estimates shrink. The crucial thing to note is that low dispersion candidates experience very little shrinkage.

After model fitting and shrinkage, the next step is to actually conduct the statistical test.

model_fit <- DESeq2::nbinomWaldTest(GSE96970_est)

After running the statistical test, we need to extract the comparison we’re interested in using the results function. We need to specify the comparison and groups we’re interested in to the contrast argument, and can specify a log Fold Change threshold

GSE_results <- DESeq2::results(model_fit, 
                               contrast = c("time", "Day8", "Day0"), 
                               lfcThreshold = 0.5,
                               alpha = 0.05)
summary(GSE_results)

out of 27430 with nonzero total read count
adjusted p-value < 0.05
LFC > 0.50 (up)    : 261, 0.95%
LFC < -0.50 (down) : 220, 0.8%
outliers [1]       : 10, 0.036%
low counts [2]     : 2128, 7.8%
(mean count < 1)
[1] see 'cooksCutoff' argument of ?results
[2] see 'independentFiltering' argument of ?results
resultsNames(GSE_results)
character(0)
shrinkage <- lfcShrink(model_fit, coef="time_Day8_vs_Day0", lfcThreshold = 0.5)
using 'apeglm' for LFC shrinkage. If used in published research, please cite:
    Zhu, A., Ibrahim, J.G., Love, M.I. (2018) Heavy-tailed prior distributions for
    sequence count data: removing the noise and preserving large differences.
    Bioinformatics. https://doi.org/10.1093/bioinformatics/bty895
computing FSOS 'false sign or small' s-values (T=0.5)

We can visualize the shrinkage directly by plotting before (black) and after (blue) fold change vs pvalue.

set.seed(111)

after <- shrinkage |>
  as.data.frame() |>
  select(after_fit = log2FoldChange, svalue)

before <- GSE_results |>
  as.data.frame() |>
  select(before_fit = log2FoldChange, pvalue, padj)

both <- bind_cols(before, after)

small_both <- both[sample(nrow(both), 300),]

small_both |>
  ggplot() +
  geom_point(aes(x=pvalue, y=before_fit), color="black") +
  geom_point(aes(x=pvalue, y=after_fit), color="blue") +
  geom_segment(aes(x=pvalue, y=before_fit, yend=after_fit),arrow=arrow(length = unit(2, "mm"))) +
  xlab("p-value") + ylab("fold change")
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
Removed 1 row containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 1 row containing missing values or values outside the scale range
(`geom_segment()`).

Another way to visualize shrinkage is with an MA plot - this plots the log2 fold change on the y-axis and mean of normalized counts on the y-axis.

plotMA(GSE_results, main= "Before Shrinkage")

Before shrinkage, we can see there are a lot of high fold change candidates at low normalized counts. Compare this to after shrinkage:

plotMA(shrinkage, main="After Shrinkage")
thresholding s-values on alpha=0.005 to color points

We can see that the shrinkage procedure has shrunk a lot of these high-fold change candidates at lower counts - notice the low counts look “tighter”.

Adjustment for multiple comparisons

The last thing we need to do is adjust for multiple comparisons. We are doing over 20,000 statistical tests, and with each test we make, the probability that we have a false discovery increase. So we need a way to correct for multiple comparisons.

We adjust our p-value distribution using what’s called a FDR (False discovery rate) procedure. Here is the original p-value distribution, and the p-value distribution after adjustment.

before <- GSE_results |>
  ggplot() +
  geom_histogram(aes(x=pvalue), fill="black", alpha=0.5, bins=40) +
  xlab("original p-value") +
  ylab("log2 Fold Change")

after <- GSE_results |>
  ggplot() +
  geom_histogram(aes(x=padj), fill="blue", alpha=0.5, bins=40) +
  xlab("adjusted p-value") +
  ylab("log2 Fold Change")

before / after
Warning: Removed 10 rows containing non-finite outside the scale range
(`stat_bin()`).
Warning: Removed 2138 rows containing non-finite outside the scale range
(`stat_bin()`).

Note that a lot of p-values get pushed towards a value of 1 - a lot more than the untransformed p-values.

The main thing to understand about this adjustment procedure is that it reduces the proportion of False Discoveries - which are candidates we think pass the p-value threshold but are not real discoveries. False discoveries are similar to False Positives, but with a different prioritization framework.

Going back to Volcano Plots

small_both |>
  ggplot() +
  geom_point(aes(y=padj, x=before_fit), color="black") +
  geom_point(aes(y=padj, x=after_fit), color="blue") +
  geom_segment(aes(y=padj, x=before_fit, xend=after_fit),arrow=arrow(length = unit(2, "mm"))) +
  geom_hline(yintercept = 0.05, lty=2) +
  annotate(geom="text", x=0, y=0.1, label="alpha") +
  xlab("log2 fold change")
Warning: Removed 14 rows containing missing values or values outside the scale range
(`geom_point()`).
Removed 14 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 14 rows containing missing values or values outside the scale range
(`geom_segment()`).

This is beginning to look familiar, doesn’t it? If we transform padj by taking the log2, we arrive at our original volcano plot:

small_both |>
  ggplot() +
  geom_point(aes(y=-log2(padj), x=before_fit), color="black") +
  geom_point(aes(y=-log2(padj), x=after_fit), color="blue") +
  geom_segment(aes(y=-log2(padj), x=before_fit, xend=after_fit),arrow=arrow(length = unit(2, "mm"))) +
  geom_hline(yintercept = -log2(0.05), lty=2) +
  annotate(geom="text", x=0, y=0.1, label="alpha") +
  xlab("log2 fold change") +
  geom_vline(xintercept = -1) +
  geom_vline(xintercept = 1)
Warning: Removed 14 rows containing missing values or values outside the scale range
(`geom_point()`).
Removed 14 rows containing missing values or values outside the scale range
(`geom_point()`).
Warning: Removed 14 rows containing missing values or values outside the scale range
(`geom_segment()`).

This may not look like volcano plots you have seen before. Most volcano plots, we don’t show the transformed p-values or shrunken fold changes.

GSE_results |>
  ggplot() +
  geom_point(aes(y=-log2(pvalue), x=log2FoldChange), color="black", alpha=0.5) +
  geom_hline(yintercept = -log2(0.05), lty=2) +
  annotate(geom="text", x=0, y=0.1, label="alpha") +
  xlab("log2 fold change") +
  geom_vline(xintercept = -1) +
  geom_vline(xintercept = 1)
Warning: Removed 10 rows containing missing values or values outside the scale range
(`geom_point()`).

Contrasts

#mcols(model_fit)
mcols(mcols(model_fit))
GSE_results <- DESeq2::results(model_fit, contrast = c("time", "Day8", "Day0"), lfcThreshold = 0.5,alpha = 0.05)
summary(GSE_results)

resultsNames(GSE_results)
shrinkage <- lfcShrink(model_fit, coef="time_Day8_vs_Day0", lfcThreshold = 0.5)

#before <- plotMA(GSE_results,MLE=TRUE)
before <- plotMA(GSE_results, main= "Before Shrinkage")
after <- plotMA(shrinkage, main="After Shrinkage")

Take Home Points

  • Fold change represents the effect size - the mean expression value of 1 group divided by the mean expression value of the 2nd group.
  • We need to filter out those genes with low expression (low count number) before we run analysis. Alternatively, we can select genes to test by using Coefficient of Variation to find the most variable genes.
  • Count Data must be normalized by the library size to compare across samples.
  • We model data using distributions, and use those models to find expression differences.
  • There are two quantities we adjust during the modeling procedure:
    • Fold change using our dispersion estimation / shrinkage procedure, which penalizes lower count candidates by shrinking their fold change. Low dispersion candidates are not as penalized as much by the shrinkage procedure
    • P-values using False Discovery Rate Adjustment, which adjusts for multiple comparisons
  • In the end, we have high fold-change, low p-value candidates for further analysis.